👋 Hello folks! Welcome to this exciting mini-project on Car Price Prediction!

In this project, we take a real-world dataset and walk through the entire Machine Learning pipeline — from data exploration to building regression models and improving results.

We begin with EDA (Exploratory Data Analysis) to understand how different features like engine size, horsepower, and mileage affect car prices. Then, we dive into data preprocessing, cleaning the data and preparing it for modeling.

We first apply Multiple Linear Regression to make predictions. But we didn’t stop there! To improve our results, we used Polynomial Regression, which helped us better capture non-linear patterns and achieve lower error rates and a more accurate model.

The project is built on the Car Price dataset, and everything is done using Python and popular libraries like pandas, matplotlib, seaborn, and scikit-learn.

Hope you enjoy the journey! 🚗📊✨

In [1]:
# Import the required Packages 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 
import sklearn 

Read the data using pandas¶

In [2]:
data = pd.read_csv("D:/NRIT Solutions/ML/Presentation/Linear Regression/Multi Linear Regression/Car Price Prediction Multiple Linear Regression/CarPrice_Assignment.csv")
data.head()
Out[2]:
car_ID symboling CarName fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase ... enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
0 1 3 alfa-romero giulia gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495.0
1 2 3 alfa-romero stelvio gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500.0
2 3 1 alfa-romero Quadrifoglio gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500.0
3 4 2 audi 100 ls gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950.0
4 5 2 audi 100ls gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450.0

5 rows × 26 columns

EDA¶

In [3]:
data_EDA = data.copy()
data_EDA.head()
Out[3]:
car_ID symboling CarName fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase ... enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
0 1 3 alfa-romero giulia gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495.0
1 2 3 alfa-romero stelvio gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500.0
2 3 1 alfa-romero Quadrifoglio gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500.0
3 4 2 audi 100 ls gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950.0
4 5 2 audi 100ls gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450.0

5 rows × 26 columns

In [9]:
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='fueltype',multiple='stack')
plt.ylabel("Count of Car Sales")
plt.xlabel("Price")
plt.title("Price vs Fueltype")
Out[9]:
Text(0.5, 1.0, 'Price vs Fueltype')
No description has been provided for this image
In [ ]:
 
In [10]:
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='aspiration',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
Out[10]:
Text(0.5, 1.0, 'Price vs Fueltype')
No description has been provided for this image
In [ ]:
 
In [11]:
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='doornumber',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
Out[11]:
<Axes: xlabel='price', ylabel='Count'>
No description has been provided for this image
In [ ]:
 
In [12]:
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='carbody',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
Out[12]:
<Axes: xlabel='price', ylabel='Count'>
No description has been provided for this image
In [13]:
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='drivewheel',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
Out[13]:
<Axes: xlabel='price', ylabel='Count'>
No description has been provided for this image
In [14]:
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='enginelocation',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
Out[14]:
<Axes: xlabel='price', ylabel='Count'>
No description has been provided for this image
In [ ]:
 
In [16]:
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='enginetype',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
Out[16]:
<Axes: xlabel='price', ylabel='Count'>
No description has been provided for this image
In [17]:
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='cylindernumber',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
Out[17]:
<Axes: xlabel='price', ylabel='Count'>
No description has been provided for this image
In [18]:
sns.set_style("darkgrid")
sns.histplot(data=data_EDA, x = 'price',kde=True,hue='fuelsystem',multiple='stack')
#plt.ylabel("Count of Car Sales")
#plt.xlabel("Price")
#plt.title("Price vs Fueltype")
Out[18]:
<Axes: xlabel='price', ylabel='Count'>
No description has been provided for this image
In [ ]:
 
In [ ]:
 
In [24]:
#boxplot()
sns.boxplot(data=data_EDA,x='price',hue='aspiration')
Out[24]:
<Axes: xlabel='price'>
No description has been provided for this image
In [ ]:
 
In [ ]:
 
In [25]:
#convert cat into int
from sklearn.preprocessing import LabelEncoder
lab_obj = LabelEncoder()

data_EDA["fueltype"] = lab_obj.fit_transform(data_EDA["fueltype"])
data_EDA["aspiration"] = lab_obj.fit_transform(data_EDA["aspiration"])
data_EDA["doornumber"] = lab_obj.fit_transform(data_EDA["doornumber"])
data_EDA["carbody"] = lab_obj.fit_transform(data_EDA["carbody"])
data_EDA["drivewheel"] = lab_obj.fit_transform(data_EDA["drivewheel"])
data_EDA["enginelocation"] = lab_obj.fit_transform(data_EDA["enginelocation"])
data_EDA["enginetype"] = lab_obj.fit_transform(data_EDA["enginetype"])
data_EDA["cylindernumber"] = lab_obj.fit_transform(data_EDA["cylindernumber"])
data_EDA["fuelsystem"] = lab_obj.fit_transform(data_EDA["fuelsystem"])


#Drop unwanted columns 
data_EDA = data_EDA.drop("car_ID",axis=1)
data_EDA = data_EDA.drop("CarName",axis=1)
data_EDA.head()

data_EDA.head()
Out[25]:
symboling fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength carwidth ... enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
0 3 1 0 1 0 2 0 88.6 168.8 64.1 ... 130 5 3.47 2.68 9.0 111 5000 21 27 13495.0
1 3 1 0 1 0 2 0 88.6 168.8 64.1 ... 130 5 3.47 2.68 9.0 111 5000 21 27 16500.0
2 1 1 0 1 2 2 0 94.5 171.2 65.5 ... 152 5 2.68 3.47 9.0 154 5000 19 26 16500.0
3 2 1 0 0 3 1 0 99.8 176.6 66.2 ... 109 5 3.19 3.40 10.0 102 5500 24 30 13950.0
4 2 1 0 0 3 0 0 99.4 176.6 66.4 ... 136 5 3.19 3.40 8.0 115 5500 18 22 17450.0

5 rows × 24 columns

In [30]:
import matplotlib.pyplot as plt
import seaborn as sns

# Compute correlation matrix
correlation_matrix = data_EDA.corr()

# Set up the matplotlib figure
plt.figure(figsize=(14, 10))  # Increased size

# Create the heatmap
sns.heatmap(
    correlation_matrix,
    annot=True,              # Show correlation numbers
    fmt=".2f",               # Format as decimal
    cmap="coolwarm",         # Color map
    linewidths=0.5,          # Line between boxes
    annot_kws={"size": 8}    # Smaller font size inside boxes
)

# Rotate axis labels for better readability
plt.xticks(rotation=45, ha='right', fontsize=9)
plt.yticks(rotation=0, fontsize=9)

# Title
plt.title("Correlation Heatmap of Numerical Features", fontsize=14)

# Show plot
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 
In [ ]:
 
In [29]:
#pairplot()
sns.pairplot(data_EDA)
Out[29]:
<seaborn.axisgrid.PairGrid at 0x218e010a960>
No description has been provided for this image
In [ ]:
 
In [56]:
#pairplot()
sns.pairplot(data_EDA,hue='fueltype')
Out[56]:
<seaborn.axisgrid.PairGrid at 0x218fcc17350>
No description has been provided for this image
In [ ]:
 
In [ ]:
 
In [ ]:
 

Data Preprocessing¶

In [31]:
#Cheking for Nan values 
data.isnull().sum()
Out[31]:
car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64
our data dont have any Nan values if we have any Nan we need to missing values imputation techniques¶
In [32]:
## Basics Statistics for our data 
data.describe()
Out[32]:
car_ID symboling fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength ... enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
count 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 ... 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000
mean 103.000000 0.834146 0.902439 0.180488 0.439024 2.614634 1.326829 0.014634 98.756585 174.049268 ... 126.907317 3.253659 3.329756 3.255415 10.142537 104.117073 5125.121951 25.219512 30.751220 13276.710571
std 59.322565 1.245307 0.297446 0.385535 0.497483 0.859081 0.556171 0.120377 6.021776 12.337289 ... 41.642693 2.013204 0.270844 0.313597 3.972040 39.544167 476.985643 6.542142 6.886443 7988.852332
min 1.000000 -2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 86.600000 141.100000 ... 61.000000 0.000000 2.540000 2.070000 7.000000 48.000000 4150.000000 13.000000 16.000000 5118.000000
25% 52.000000 0.000000 1.000000 0.000000 0.000000 2.000000 1.000000 0.000000 94.500000 166.300000 ... 97.000000 1.000000 3.150000 3.110000 8.600000 70.000000 4800.000000 19.000000 25.000000 7788.000000
50% 103.000000 1.000000 1.000000 0.000000 0.000000 3.000000 1.000000 0.000000 97.000000 173.200000 ... 120.000000 5.000000 3.310000 3.290000 9.000000 95.000000 5200.000000 24.000000 30.000000 10295.000000
75% 154.000000 2.000000 1.000000 0.000000 1.000000 3.000000 2.000000 0.000000 102.400000 183.100000 ... 141.000000 5.000000 3.580000 3.410000 9.400000 116.000000 5500.000000 30.000000 34.000000 16503.000000
max 205.000000 3.000000 1.000000 1.000000 1.000000 4.000000 2.000000 1.000000 120.900000 208.100000 ... 326.000000 7.000000 3.940000 4.170000 23.000000 288.000000 6600.000000 49.000000 54.000000 45400.000000

8 rows × 25 columns

In [33]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    int64  
 4   aspiration        205 non-null    int64  
 5   doornumber        205 non-null    int64  
 6   carbody           205 non-null    int64  
 7   drivewheel        205 non-null    int64  
 8   enginelocation    205 non-null    int64  
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    int64  
 15  cylindernumber    205 non-null    int64  
 16  enginesize        205 non-null    int64  
 17  fuelsystem        205 non-null    int64  
 18  boreratio         205 non-null    float64
 19  stroke            205 non-null    float64
 20  compressionratio  205 non-null    float64
 21  horsepower        205 non-null    int64  
 22  peakrpm           205 non-null    int64  
 23  citympg           205 non-null    int64  
 24  highwaympg        205 non-null    int64  
 25  price             205 non-null    float64
dtypes: float64(8), int64(17), object(1)
memory usage: 41.8+ KB
from the above we can observe some columns have categorical data we need to transform it into numerical using Feature transform techniques¶
In [34]:
#convert cat into int
from sklearn.preprocessing import LabelEncoder
lab_obj = LabelEncoder()

data["fueltype"] = lab_obj.fit_transform(data["fueltype"])
data["aspiration"] = lab_obj.fit_transform(data["aspiration"])
data["doornumber"] = lab_obj.fit_transform(data["doornumber"])
data["carbody"] = lab_obj.fit_transform(data["carbody"])
data["drivewheel"] = lab_obj.fit_transform(data["drivewheel"])
data["enginelocation"] = lab_obj.fit_transform(data["enginelocation"])
data["enginetype"] = lab_obj.fit_transform(data["enginetype"])
data["cylindernumber"] = lab_obj.fit_transform(data["cylindernumber"])
data["fuelsystem"] = lab_obj.fit_transform(data["fuelsystem"])
In [35]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    int64  
 4   aspiration        205 non-null    int64  
 5   doornumber        205 non-null    int64  
 6   carbody           205 non-null    int64  
 7   drivewheel        205 non-null    int64  
 8   enginelocation    205 non-null    int64  
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    int64  
 15  cylindernumber    205 non-null    int64  
 16  enginesize        205 non-null    int64  
 17  fuelsystem        205 non-null    int64  
 18  boreratio         205 non-null    float64
 19  stroke            205 non-null    float64
 20  compressionratio  205 non-null    float64
 21  horsepower        205 non-null    int64  
 22  peakrpm           205 non-null    int64  
 23  citympg           205 non-null    int64  
 24  highwaympg        205 non-null    int64  
 25  price             205 non-null    float64
dtypes: float64(8), int64(17), object(1)
memory usage: 41.8+ KB
In [36]:
data.head()
Out[36]:
car_ID symboling CarName fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase ... enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
0 1 3 alfa-romero giulia 1 0 1 0 2 0 88.6 ... 130 5 3.47 2.68 9.0 111 5000 21 27 13495.0
1 2 3 alfa-romero stelvio 1 0 1 0 2 0 88.6 ... 130 5 3.47 2.68 9.0 111 5000 21 27 16500.0
2 3 1 alfa-romero Quadrifoglio 1 0 1 2 2 0 94.5 ... 152 5 2.68 3.47 9.0 154 5000 19 26 16500.0
3 4 2 audi 100 ls 1 0 0 3 1 0 99.8 ... 109 5 3.19 3.40 10.0 102 5500 24 30 13950.0
4 5 2 audi 100ls 1 0 0 3 0 0 99.4 ... 136 5 3.19 3.40 8.0 115 5500 18 22 17450.0

5 rows × 26 columns

In [37]:
#Drop unwanted columns 
data = data.drop("car_ID",axis=1)
data = data.drop("CarName",axis=1)
data.head()
Out[37]:
symboling fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength carwidth ... enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
0 3 1 0 1 0 2 0 88.6 168.8 64.1 ... 130 5 3.47 2.68 9.0 111 5000 21 27 13495.0
1 3 1 0 1 0 2 0 88.6 168.8 64.1 ... 130 5 3.47 2.68 9.0 111 5000 21 27 16500.0
2 1 1 0 1 2 2 0 94.5 171.2 65.5 ... 152 5 2.68 3.47 9.0 154 5000 19 26 16500.0
3 2 1 0 0 3 1 0 99.8 176.6 66.2 ... 109 5 3.19 3.40 10.0 102 5500 24 30 13950.0
4 2 1 0 0 3 0 0 99.4 176.6 66.4 ... 136 5 3.19 3.40 8.0 115 5500 18 22 17450.0

5 rows × 24 columns

In [38]:
#Divide the features into independent and dependent variables in terms of X and Y
x = data.iloc[:, 0:-1]
y = data.iloc[:, [-1]]
In [39]:
#Feature Scaling
from sklearn.preprocessing import MinMaxScaler
x_scl = MinMaxScaler()
y_scl = MinMaxScaler()

x = x_scl.fit_transform(x)
y = y_scl.fit_transform(y)
In [40]:
# spliting data into train and test 
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

Linear Model Building¶

In [41]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
In [42]:
#fitting data to model
reg.fit(x,y)
Out[42]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [43]:
reg.score(x,y)
Out[43]:
0.8802434927425343
In [44]:
#making predictions 
y_predict = reg.predict(x_test)
In [45]:
from sklearn.metrics import root_mean_squared_error
RMSE = root_mean_squared_error(y_pred=y_predict,y_true=y_test)
RMSE
Out[45]:
0.0843969173808874
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Polynomial Regression¶

In [46]:
import pandas as pd
import numpy as np
In [47]:
data = pd.read_csv("D:/NRIT Solutions/ML/Presentation/Linear Regression/Multi Linear Regression/Car Price Prediction Multiple Linear Regression/CarPrice_Assignment.csv")
In [48]:
#convert cat into int
from sklearn.preprocessing import LabelEncoder
lab_obj = LabelEncoder()

data["fueltype"] = lab_obj.fit_transform(data["fueltype"])
data["aspiration"] = lab_obj.fit_transform(data["aspiration"])
data["doornumber"] = lab_obj.fit_transform(data["doornumber"])
data["carbody"] = lab_obj.fit_transform(data["carbody"])
data["drivewheel"] = lab_obj.fit_transform(data["drivewheel"])
data["enginelocation"] = lab_obj.fit_transform(data["enginelocation"])
data["enginetype"] = lab_obj.fit_transform(data["enginetype"])
data["cylindernumber"] = lab_obj.fit_transform(data["cylindernumber"])
data["fuelsystem"] = lab_obj.fit_transform(data["fuelsystem"])


#Drop unwanted columns 
data = data.drop("car_ID",axis=1)
data = data.drop("CarName",axis=1)

#Divide the features into independent and dependent variables in terms of X and Y
x = data.iloc[:, 0:-1]
y = data.iloc[:, [-1]]
In [49]:
#Feature Scaling
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
x_scl = StandardScaler()
y_scl = StandardScaler()

x = x_scl.fit_transform(x)
y = y_scl.fit_transform(y)
In [50]:
# spliting data into train and test 
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
In [51]:
# Create Polynomial Features (degree=2)
poly = PolynomialFeatures(degree=2)
x = poly.fit_transform(x)
x_poly_train = poly.fit_transform(x_train)
x_poly_test = poly.fit_transform(x_test)

#Fit Linear Regression on polynomial features
model = LinearRegression()
model.fit(x, y)
Out[51]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [52]:
model.score(x, y)
Out[52]:
0.9987438222456763
In [53]:
model.score(x_poly_test, y_test)
Out[53]:
0.9995488048241306
In [54]:
# Step 4: Make predictions
y_pred = model.predict(x_poly_test)
In [55]:
from sklearn.metrics import root_mean_squared_error
RMSE = root_mean_squared_error(y_true=y_test,y_pred=y_pred)
RMSE
Out[55]:
0.023682050252017823
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: